Parle: parallelizing stochastic gradient descent

نویسندگان

  • Pratik Chaudhari
  • Carlo Baldassi
  • Riccardo Zecchina
  • Stefano Soatto
  • Ameet Talwalkar
چکیده

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4× faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without ...

متن کامل

SpeeDO: Parallelizing Stochastic Gradient Descent for Deep Convolutional Neural Network

Convolutional Neural Networks (CNNs) have achieved breakthrough results on many machine learning tasks. However, training CNNs is computationally intensive. When the size of training data is large and the depth of CNNs is high, as typically required for attaining high classification accuracy, training a model can take days and even weeks. In this work, we propose SpeeDO (for Open DEEP learning ...

متن کامل

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few i...

متن کامل

Multi-GPU Training of ConvNets

In this work, we consider a standard architecture [1] trained on the Imagenet dataset [2] for classification and investigate methods to speed convergence by parallelizing training across multiple GPUs. In this work, we used up to 4 NVIDIA TITAN GPUs with 6GB of RAM. While our experiments are performed on a single server, our GPUs have disjoint memory spaces, and just as in the distributed setti...

متن کامل

Parallelizing Big Data Machine Learning Algorithms with Model Rotation

This paper investigates a novel approach to parallelization of machine learning algorithms using model rotation as an effective parallel computation model. We identify the importance of model rotation owing to its ability to shift the latest model updates to a neighboring computation, thereby guaranteeing model consistency which is hard to achieve in other computation models. We distinguish com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.00424  شماره 

صفحات  -

تاریخ انتشار 2017